ACTIVE SUBSPACE METHODS IN THEORY AND PRACTICE: 
APPLICATIONS TO KRIGING SURFACES 

PAUL G. CONSTANTINE*, ERIC DOWt, AND QIQI WANG* 

Abstract. Many multivariate functions encountered in practical engineering models vary pri- 
marily along a few directions in the space of input parameters. When these directions correspond 
to coordinate directions, one may apply global sensitivity measures to determine the most influen- 
tial parameters. However, these methods perform poorly when the directions of variability are not 
aligned with the natural coordinates of the input space. We present a method to flrst detect the 
directions of variability using evaluations of the gradient and subsequently exploit these directions 
to construct a response surface on a low dimensional subspace - i.e., the active subspace - of the 
inputs. We develop a rigorous theoretical framework with computable error bounds, and we link the 
vN theoretical quantities to the parameters of a kriging response surface on the active subspace. We 

^ apply the method to a nonlinear heat transfer model on a turbine blade with a 250-parameter heat 

^ I flux model representing uncertain transition to turbulence, and we compare its performance to other 

common approaches for input dimension reduction. 
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1. Introduction S^ motivation. As computational models of physical systems 
become more complex, the need increases for uncertainty quantification (UQ) to en- 
r"| able defensible predictions. Monte Carlo methods are the workhorse of UQ, where 

model inputs are sampled according to a characterization of their uncertainty and 
the corresponding model outputs are treated as a data set for statistical analysis. 
However, the slow convergence of Monte Carlo methods coupled with the high com- 
putational cost of the models has led many to employ response surfaces trained on 
a few carefully selected runs in place of the full model. This strategy has had great 
^ success in forward and inverse uncertainty propagation problems [24l [28l [10] as well 

as optimization [21j. However, most response surfaces suffer from the curse of dimen- 
(^ sionality, where the cost of constructing an accurate surface increases exponentially 

^S| as the dimension (i.e., the number of input parameters) increases. 

_il To make construction tractable, one may first perform sensitivity analysis [35^ to 

C^ determine which variables have the most influence on the model predictions. With 

CO a ranking of the inputs, one may construct response surfaces that focus on the most 

^"^ influential variables, e.g., through a suitably anisotropic design; the same concept 

^ applies to mesh refinement strategies for solving PDEs. Methods for sensitivity anal- 

ysis are typically classified as local perturbation or global methods. Local methods 
perturb one input at a time around a nominal value and measure the effects on the 



^ outputs. These methods are relatively inexpensive but lack robustness in the pres- 

ence of noisy data. Local methods are fraught with other difficulties, e.g., how much 
should one perturb the input? And how can one be sure that the effects measured 
at the nominal condition are similar elsewhere in the parameter space? Global meth- 
ods address these issues by providing integrated measures of the output's variability 
over the full range of parameters; consequently, they are computationally more ex- 
pensive. Methods based on variance decompositions [30l |35] require approximating 
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high-dimensional integrals in order to rank the inputs. 

Both classes of methods rank the coordinates of the inputs. However, some models 
may vary primarily along directions of the input space that are not aligned with the 
coordinate system. The analogy of inputs as knobs is useful here; turning two knobs 
simultaneously in some relative proportion may have a greater effect than turning 
only the most influential knob. Consider the function exp(axi -\-bx2) shown in Figure 



5.1 with particular values of a and b. Changing (xi,X2) proportional to (a, 6) will 
have the greatest effect, while changing the variables proportional to (6, —a) will have 
no effect on the function. In effect, the function has one direction of variability and 
an orthogonal direction of flatness. It would be useful to know such directions for 
a given model with many inputs, where we could construct a response surface only 
along the directions of variability, i.e., on a subspace of model inputs. 

We propose a method based on gradient evaluations for detecting and exploiting 
the directions of variability of a given function to construct an approximation on a 
low-dimensional subspace of the function's inputs. Given continued interest in gradi- 
ent computations based on adjoint methods [5l[T9] and algorithmic differentiation [16j , 
it is not unreasonable to assume that one has access to the gradient of the function. 
We detect the directions by evaluating the function's gradient at a set of input points 
and determining a rotation of the input space that separates the directions of relative 
variability from directions of relative flatness. We exploit these directions by first pro- 
jecting the input space to the low-dimensional subspace that captures the function's 
variability and then approximating the function on the subspace. Following Russi's 
2010 Ph.D. thesis [33], we call this low-dimensional subspace the active subspace. 

Subspace approximations are commonly used in optimization, where local quadratic 
models of a function are decomposed to reveal search directions. They are also found in 
many areas of model reduction [2 and optimal control [38 , where a high-dimensional 
state space vector is approximated by a linear combination of relatively few basis 
vectors. Common to both of these fields are methods for matrix factorizations and 
eigenvalue computations [34 , which are replete with subspace oriented approaches. 
The use of subspace methods for approximating high-dimensional functions arising in 
science and engineering models appears rare by comparison. Recent work by Lieber- 
man et al [25 describes a method for finding a subspace in a high-dimensional input 
space via a greedy optimization procedure. Russi [33] proposes a method for discover- 
ing the active subspace of a function using evaluations of the gradient and constructing 
a quadratic response surface on the subspace; his methodology is similar to ours. Re- 
cently Fornasier et al [13] analyzed subspace approximation algorithms that do not 
need gradient evaluations but make strong assumptions on the function they are ap- 
proximating; they take advantage of results from compressed sensing. Our previous 
work has applied the active subspace method to design optimization [7l [12] , inverse 
analysis [10 , and spatial sensitivity [TT] . 

The contribution of this paper is two-fold. First we provide a rigorous theoreti- 
cal foundation for gradient-based dimension reduction and subspace approximation. 
We construct and factorize a covariance-like matrix of the gradient to determine the 
directions of variability. We then construct a low-dimensional best approximation of 
the function via conditional expectation. We derive an error bound for the approx- 
imation in terms of the eigenvalues of the gradient covariance matrix along with a 
corollary that justifies our computational procedure. Second, we provide a bridge be- 
tween the theoretical analysis and computational practice by (i) relating the derived 
error bounds to SVD-based approaches for discovering the active subspace and (ii) 
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heuristically linking the theoretical quantities to the parameters of a kriging response 
surface constructed on the active subspace. We apply this procedure to a nonlinear 
heat transfer model on a turbine blade, where a 250 parameter model for heat flux on 
the blade surface represents an uncertain transition zone from laminar to turbulent 
flow. We compare the subspace dimension reduction to other common strategies for 
sensitivity-based dimension reduction. 

2. Theoretical framework. In this section we construct the theoretical frame- 
work for approximating a multivariate function on the active subspace including error 
bounds. We perform the analysis using techniques from probability theory, but we 
emphasize that there is nothing inherently stochastic about the functions or the ap- 
proximations. We will use a subscript to denote conditional expectation. For example, 
if / = /(y, z), then we use Ey [/] to denote the conditional expectation of / given y. 
We will also use the subscript for the gradient operator to denote partial derivatives 
with respect to particular variables. 

2.1. Directions of variability and the active subspace. Consider the Tri- 
dimensional function 

/ = /(x), xeA-CR-. (2.1) 

Let the space X be equipped with a probability density function p = p(x). We assume 
that / is continuous, different iable, and square integrable with respect to p. Denote 
the gradient of / by the column vector 



Vx/(x) = 






df 

OXyti 



(2.2) 



Define the m x m matrix C by 

C = E [(Vx/) (Vx/)^] , (2.3) 

where we assume that / is such that C exists. C can be interpreted as the uncentered 
covariance of the gradient vector. Note that C is symmetric and positive semidefinite, 
so it admits a real eigenvalue decomposition 

C = WAW^, A = diag (Ai, . . . , A^), Ai > • • • > A^ > 0. (2.4) 

The following lemma quantifies the relationship between the gradient of / and the 
eigendecomposition of C. 

Lemma 2.1. The mean squared gradient of f along the direction given by the 
eigenvector w^ is equal to the corresponding eigenvalue, 

E [((Vx/)^Wi)'] = Xi (2.5) 

Proof. By the definition of C, 
A, = wf Cwi = wf (E [(Vx/) (Vx/)T) w, = E [{{V^ffwif] , (2.6) 

as required. D 
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Lemma [2?!] implies, for example, that if A^ is zero, then / is flat along the direction 
w^ in R^. The essence of the low dimensional approximation is to detect and remove 
directions along which / varies relatively little. 

The eigenvectors W define a rotation of R^ and consequently the domain of 
/. With eigenvalues in decreasing order, we can separate components of the rotated 
coordinate system into a space which contains the majority of /'s variability - i.e., 
the active subspace - and its orthogonal complement along which / is on average 
relatively fiat. The eigenvalues and eigenvectors are partitioned as 



A 



Ai 

A2 



W = [Wi W2] (2.7) 



where Ai = diag (Ai, . . . , A^) with n < m, and Wi is m x n. Define the rotated 
coordinates [y,z]^ by 

y = Wfx, z = W|^x. (2.8) 

Then we have the following lemma. 

Lemma 2.2. The mean squared gradients of f with respect to the coordinates 
[y,z]^ satisfy 



E[(Vy/)^(Vy/)]=Al + ---+An 

E [(Vz/)^(Vz/)] =An+i + --- + A, 



(2.9) 



Proof First note that we can write 

/(x) = /(WW^x) 

= /(WiWf X + WsWi^x) (2.10) 

= /(Wiy + W2z) 

By the chain rule, the gradient of / with respect to y can be written 

Vy/(x) = Vy/(Wiy + W2z) = WfVx/(Wiy + W2z) = WfVx/(x). (2.11) 
Then 

E [(Vy/)^(Vy/)] =E [trace ((Vy/)(Vy/)^)] 
= trace (E[(Vy/)(Vy/)^]) 



(2.12) 



= trace (Wf E [(Vx/)(Vx/)^] Wi) 
= trace (Wf CWi) 
= trace (Ai) 
= Ai + • • • + An, 

as required. The derivation for the z components is similar. D 

Lemma [2^ motivates the use of the label active subspace. In particular, / varies 
more on average in the subspace defined by the columns of Wi than in the subspace 
defined by W2, as quantified by the eigenvalues of C. 
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2.1.1. Two important special cases. There are two cases that show up in 
practice where the rank of C may be determined a priori. The first is a ridge func- 
tion [8], which has the form 



/(x) = /i(a^x), 



(2.13) 



where /i is a univariate function, and a is a constant ?7i-vector. In this case, C is 
rank one, and the eigenvector defining the active subspace is a/||a||, which can be 
discovered by a single evaluation of the gradient anywhere in A'. The function shown 
in Figure |5.1| is an example of a ridge function. 

The second special case is a function of the form 

/(x) = /i(x^Ax), (2.14) 

where /i is a univariate function and A is a symmetric m x m matrix. In this case 

C = AE [(/i'(x^Ax))2xx^] A^, (2.15) 

where h' is the derivative of h. This implies that the null space of C is the null space 
of A provided that h' is non-degenerate. 

2.2. Approximation on the active subspace. We can now construct a func- 
tion of n variables that approximates /. Before formally defining such a function, we 



offer some intuition based on the decomposition in (2.10). Since perturbations in z 



change / relatively little on average, we would like to have a function of only y that 
approximates /. For a fixed y, the best guess one could make for / is the mean of / 
over all values of z. This is precisely the conditional expectation of / given y. Define 
the n-variate function /n(y) by 



fn = /„(y) = Ey [/] . 



(2.16) 



Since /„ is a conditional expectation, it is the best approximation of / given y |371 
Chapter 9] . The domain of /„ is 



3; = {y : y = Wf X, x e A"} 



(2.17) 



The next theorem bounds the error of fn in terms of the eigenvalues of C from (2.3). 



Theorem 1. The mean squared error of fn defined in (2.16) is bounded by 
E [(/ - fnf] < Cp (A„+i - 



+ \m) 



(2.18) 



where Cp is a constant that depends only on the domain X and the weight function 
P- 



Proof. Note that Ey [/ - /„] = by the definition ( |2.16[ ). Thus, 

E[(/-/„)']=E[Ey[(/-/„)2]] 

<CpE[Ey[(V./)^(V./)]] 
= CpE[(V./)^(V./)] 

= Cp (An+1 + • • • + A^). 



(2.19) 
(2.20) 
(2.21) 
(2.22) 



Lines (2.19) and (2.21) are due to the tower property of conditional expectations. 



Line (2.20) is a Poincare inequality, where the constant Cp depends only on X and 



the density function p. Line (2.22) follows from Lemma 2.2 D 
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Theorem [l] provides an error bound for the approximation fn- However, fn may 
be very difficult to compute in practice since every evaluation given y requires a high 
dimensional integral over the coordinates z. We would like to use a single evaluation 
of /(x) to approximate /n(y) with y = Wf x. In the following corollary, we show 
that such a strategy is well motivated if the variation of / along the coordinates z is 
sufficiently small. 

Corollary 2.3. Given e > 0, the probability that f deviates from fn by e is 
bounded by 

^[\f-fn\>e] < ^ (An+1 + • • • + A^) (2.23) 

where Cp is the Poincare constant from Theorem\^ 

Proof. This follows directly from the Chebyshev inequality and Theorem [l] D 

3. Computational framework. In this section we present a computational 
framework for constructing an approximation to /n(y) = Ey [/]; section 3.3 contains 
a step-by-step computational algorithm, and eager readers may begin there. We 
assume that it is possible to evaluate the gradient Vx/ at a given point x G A'. This 
is not an unreasonable assumption given the increased interest in adjoint methods for 
computing sensitivities [I9l [5 and algorithmic differentiation [16] . If derivatives are 
not available, then one can in principle use finite differences to evaluate the gradient. 
However, this approach requires tti + 1 evaluations of / for every evaluation of the 
gradient, which may be infeasible in many applications. Also, finite difference methods 
fail when applied to noisy function evaluations. An intriguing alternative can be found 
in [13^, where derivatives are not required but the function must have special structure. 

3.1. Discovering the active subspace. We must first compute the eigenvec- 



tors W and eigenvalues A of the matrix C from (2.4). We immediately encounter 
the obstacle of computing the elements of C, which are integrals over the high di- 
menisonal space X. Thus, tensor product numerical quadrature rules are impractical. 
We opt for Monte Carlo integration, which will yield its own appealing interpretation. 
In particular, let 

Vx/, = Vx/(x,), X, gA', j = l,...,M, (3.1) 

be independently computed samples of the gradient vector, where Xj is drawn from the 
density p on X. Note that in practice computing the gradient at Xj typically involves 
first evaluating fj = fi'^j)] we will use these function evaluations when training and 
testing the response surface. With the samples of the gradient, we approximate 

1 ^ 
^ ^ ^ = ]^E(^x/,)(Vx/,r, (3.2) 

and compute the eigenvalue decomposition 

C = WAW^. (3.3) 

The size ofCismxm^ where we expect m to be on the order of hundreds or thousands 
corresponding to the number of variables x. Thus we anticipate no issues computing 
the complete eigendecomposition of C on a modern personal computer. 
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There is another interpretation of the samphng approach to approximate the 
eigenpairs of C. We can write 



GG^, (3.4) 



where the m x M matrix G is 



G = ^ [Vx/i • • • Vx/m] . (3.5) 



M 

If we compute the singular value decomposition (SVD) G = USV^, where U has 
dimension m x m and V has dimension M x m^ then 

C = GG^ = (UEV^) (VS^U^) = US^U^. (3.6) 

Thus, U = W up to a change in sign, and S^ = A. This provides an alternative 
computational approach via the SVD. Again, we stress that the number of variables 
m and the number of gradient samples M are sufficiently small in many applications 
of interest so that the SVD can easily be computed on a modern personal computer. 
More importantly, the SVD shows that the rotation matrix W can be interpreted as 
the uncentered principal directions ^20j from an ensemble of gradient evaluations. 

In practice, we use W and A in place of W and A. We will omit the distinction 
between the two in the remainder of the paper. The errors in this approximation 
may be significant, particularly if the number M of samples of the gradient is too 
small. However, a thorough analysis of these errors is beyond the scope of the present 
work. We anticipate that results from classical perturbation theory for eigenvalues 
and eigenvectors [29 will prove useful. 

3.2. Building a response surface. In this section, we detail a procedure to 
construct a response surface on the n-dimensional active subspace defined by Wi from 



(2.7). At this point, we assume that we have the eigenvectors W and eigenvalues A 
from the Monte Carlo estimate of C described in the previous section. 

We must first choose the dimension n of the subspace. Many reduction methods 
use the magnitude of the A^ to define n, e.g., so that Ai + • • • + A^ exceeds some 
proportion of Ai + • • • + A^. This, however, is not our aim. We are bound instead 
by more practical considerations, such as choosing n small enough to construct a 
reasonable set of training sites (e.g., a mesh) for the response surface; the trailing 
eigenvalues A^+i, • • • , Xm then inform a noise model for the response surface. A rapid 
decay in the A^ implies that the low dimensional approximation is relatively more 
accurate (see Theorem fl]), but it is not necessary to apply the reduction method. 



On the reduced domain y in (2.17), we seek a function fn ^ /n, where fn is 



the conditional expecation defined in (2.16). Global or piecewise polynomial approx- 



imation offers one possible construction. In this work, we will use a kriging response 
surface [22] (also known as Gaussian process approximation ^ and closely related 
to radial basis approximation [36]). The primary reason for this choice is that the 
computed A^ can be used to inform the parameters of the kriging surface - both a 
variance model for the training data and the correlation lengths along each of the 
reduced coordinates. We demonstrate this in Section [3.2.31 

Once the function fn is trained, the following procedure is used to approximate 

/■ 

1. Given x G A', compute y = Wf x. 

2. Evaluate the response surface /n(y)- 
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3. Set /(x) = /„(y).^ 
In the context of kriging, /n(y) is interpreted as the expectation of a Gaussian random 
variable, and the variance of the prediction is provided as weh. 



3.2.1. The reduced domain. The domain of /n(y) is y from (2.17). If the 
domain X of /(x) is unbounded and equal to R^, then the reduced domain y is 
also unbounded and equal to R^. In such a case, constructing the reduced domain 
is straightforward; it is simply a rotation followed by a coordinate projection applied 
to A'. An important special case is when A' is unbounded and the weight function 
p(x) is a Gaussian probability density function. In this case, the projected density 
function is also Gaussian. We do not pursue the case of unbounded domains further. 
However, the techniques we use to construct the response surface on the subspace for 
bounded domains are similar for unbounded domains. 

If A' is compact and p(x) has compact support, then one must take care when 
constructing the reduced domain y. We will specialize to the case where A' is a 
hypercube [—1, 1]^, which we write as 

A' = {x : -1 <x< 1}. (3.7) 

Mathematically, this restriction is not necessary. But it is much more convenient 
computationally, since finding a point in A' that corresponds to a given point in y 
becomes a standard linear program [27]. 

When A' is a hypercube, the reduced domain y can be found by first projecting 
each corner of A' to R"^ by Wi and taking the convex hull. We write this as 

y = conv ({y : y = Wfx, x G {-1, 1}™}) , (3.8) 

where { — 1, 1}^ is the space of TTi-vectors with entries equal to -1 or 1. Figure [5.2a| 
shows the two dimensional projection of a three dimensional cube (blue) and the 
convex hull of its corners (green). However, this approach requires finding the convex 
hull of 2^ points in R"^, which may be intractable if m is too large. 

Instead, we propose to create a bounding box for y as follows. For each column 
w^ from Wi , compute upper and lower bounds for the corresponding reduced variable 
i/i with a bound constrained linear program 

Vi = minimum(wfx), ^J' = -yl (3.9) 

— 1<X<1 

In fact, the linear program has a closed form solution yl = — w^sign(w^), where 
sign(wi) returns an tti- vector with the sign of each component of w^. However, we 
find the linear program formulation more intuitive. 
The bounding box is then given by 

3) = {y : yGR", y, <y<yj, (3.10) 



5.2a 



where y/ = [?/[,..., yl^]^ and ju = [^i , . . . , y^]^- By construction y ^ y. Figure 
shows the two dimensional bounding box (red) for the projected three dimensional 
cube (blue). 

3.2.2. Training data on the reduced domain. To create training data for 
the kriging surface on the active subspace, we must evaluate the function /^ at a set 
of points in the bounding box domain y. We already have M evaluations fj that 
exist at the points jj = Wfxj, which were computed during the sampling of the 
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gradient (Section 3.1). It is natural to want to use the fj when training the kriging 
surface on the subspace to avoid unnecessary evaluations of /. However, there is no 
guarantee that the points y^ comprise a good design on the subspace; they will likely 
cluster together, which causes difficulties for the conditioning of the kriging surface. 
Therefore, we set the existing function evaluations aside during the training and use 
them instead to test the quality of the response surface. 

To ensure a good design on the reduced domain, we choose it independently of 
the Xj used to sample the gradient of /. Let y^ with /c = 1, . . . , A^ be an initial set 
of design points in the bounding box domain y chosen to enable a well-conditioned 
training step. There are many options for choosing the points of an experimental 



design, many of which are detailed in [22 . In Figure 5.2a, we show the points of a 
tensor grid of uniformly spaced points in red as an example. 

For each y/e, we want to find a point x* in the full ?7i- dimensional domain A' and 
set the training data equal to /(x*). However, not every design site on the bounding 
box y will correspond to a point in A'. We must therefore remove sites from the 
initial design on y that do not correspond to points in A'. For the hypercube case, 
this involves solving Phase 1 of a linear program [27] for each point in the design. In 
particular, for yj^ G 3^, we seek x* that satisfies 

yk = Wfx*, -l<x*<l. (3.11) 

If no such X* exists, then the linear program solver will indicate that the constraint 
set is infeasible. In this case we remove y^ from the initial design and decrement N 
by one. We are left with a subset of the initial design sites where we are guaranteed 
that each remaining site corresponds to at least one point in A'. The initial design 



on y is shown in Figure 5.2b the remaining points after pruning are shown in Figure 



5.2c with the true reduced domain in green for reference. 



For each remaining y/^, we find x* that satisfies (3.11) and compute the training 
data 

fn,k = fniYk) = /(X*). (3.12) 

There may be an infinite number of such x* , each corresponding to a different value 



of the coordinates z. By Corollary 2.3, we are assured that the value /(x*) is unlikely 
to deviate too far from its conditional mean /n(y/c)- Motivated by the error bounds 
in Theorem IT] we will build a noise model that represents the deviation of /(x*) from 
fn{yk)' And we will use this noise model when constructing a kriging surface. 

3.2.3. Training the kriging response surface. At this point we have values 
fn^k foi" each design site y^ ^ y that we will use to train the kriging surface. We 
assume that the function we are trying to approximate is smooth with respect to 
the coordinates y. However, since the training data fn^k are not exactly equal to 
the conditional expectation fn{yk)^ we do not want to force the kriging surface to 
exactly interpolate the training data. Instead, we want to build a model for the 
noise in the training data that is motivated by Theorem [l] and incorporate it into 
the kriging surface. We therefore choose the correlation matrix of the training data 
to be a product-type squared exponential kernel with an additional diagonal term to 
represent the noise. 

Gov /n(y/ci), /n(yfe2) = K{yk^,yk^)^rf8{ki,k2) (3.13) 
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where S{ki, k2) is 1 if ki = k2 and zero otherwise, and 



K{yki,yk2) = exp 



y^_(^fei,i -yk2^f 



2e 



(3.14) 



Along with the correlation function, we choose a quadratic mean term, which will add 
{^^ ) polynomial basis functions to the kriging approximation. 

We are left to determine the parameters of the kriging surface, including the 
correlation lengths ^^ from (3.14) and the parameter rf' of the noise model from 
(3.13). Given values for these parameters, the coefficients of both the polynomial 
bases and the linear combination of the training data are computed in the standard 
way [22l[32]. We could also use a standard maximum likelihood method to compute 
(.i and rf . However, we can inform these parameters using the quantities from Lemma 
Oand Theorem [H 

Toward this end, we approximate the directional derivative of / along w^ with a 
finite difference, 



Vx/(x)^w, 



1 



(/(x + (5wO-/(x)), 



(3.15) 



which is valid for small 5. Decompose /(x) = /o + /'(x), where /o = E [/], so that f 



has zero- mean. By Lemma 2.1 



A, = E [((Vx/)^w,)2] 
1 



J2 
1 

J2 

2 



(/(x + (5wi)-/(x))^ 
(/'(x + ,5w,)-/'(x))' 



(3.16) 



-^2(<t2-Cov[/'(x + ,5w,), /'(x)]), 
where a^ = Var [/]. Rearranged, we have 

Corr[/'(x + .5w,),/'(x)] = ^Cov [/'(x + 5w,), /'(x)] = 1 " ^• 



(3.17) 



This implies that the correlation function along Wj is locally quadratic near the origin 
with coefficient \i/2(j'^ . A univariate Gaussian correlation function with correlation 
length parameter t has a Taylor series about the origin 



exp 



'2P 



= 1 



2£2 



(3.18) 



Comparing these terms to the locally quadratic approximation of the correlation func- 
tion, we can use a univariate Gaussian with correlation length 



(' 






(3.19) 



for the reduced coordinate i/i. Thus we need to approximate the variance a^. 
Applying Theorem [l] with n = 0, we have 



a^ = Var[/] < Cp (Ai + • • • + A^). 



(3.20) 
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Unfortunately, the Poincare inequality is a notoriously loose bound, so we are reluctant 
to simply plug in an estimate of Cp to approximate a^. Instead, we posit that for 
some constant a, 

a' = a(Ai + --- + A^), (3.21) 

where a < Cp^ and we can employ an estimate of Cp for the hypercube from [4], 

2^/rn 



Cp 



(3.22) 



To get a rough lower bound on a, we use the biased estimator of a"^ from the samples 
fj computed in Section 3.1, 



1 ^ 
JfZ2^f^ -/o)^ 



(3.23) 



j=i 



where /o is the empirical mean of the fj. Since M will generally be small due to 
limited function evaluations, we expect the bias to be significant enough to justify the 
bound 



a^ < a(Ai + -.- + A^). 



Combining (3.24) and (3.22), we have 

^2 



Ai + • • • + A^ 



< a < 



2^/m 



TT 



(3.24) 



(3.25) 



From here, we treat a as a hyperparameter for the correlation kernel (3.14), and we 
can use a maximum likelihood approach to set it. Note that the number of variables 
in the maximum likelihood is one, in contrast to n + 1 variables for the standard 
approach. Given a value for alpha, we set a^ by (3.21) and 

2 



T] 



a (An+i H h A^) 



(3.26) 



in (3.13). We now have all necessary quantities to build the kriging surface on the 
active subspace. 

3.3. A step-by-step algorithm. We summarize this section with an algorithm 
incorporating the previously described computational procedures given a function 
/ = /(x) and its gradient Vx/ = Vx/(x) defined on the hypercube [—1, 1]^. 

1. Initial sampling: Choose a set of M points Xj G [—1,1]^ according to 
the measure p(x). For each Xj, compute fj = /(xj) and Vx/j = Vx/(xj). 
Compute the sample variance a^. 

2. Gradient analysis: Compute the SVD of the matrix 



[Vx/l 



Vx/m] = WSV^, 



(3.27) 



and set A = S^. Choose a reduced dimension n < m according to practical 
considerations and the decay of A^. Partition W = [Wi W2] . 
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3. Reduced domain: Define the reduced coordinates yi = w^x for i = 
1, . . . , n. Construct a bounding box for tiie reduced coordinates 

y^ = minimum(wf x), y'^ = —y^^ (3.28) 

— 1<X<1 

with yi = [y[, . . . ,yl]^ and ju = [^x , . . . , ^^]^. Define the reduced domain 

3^ = {y : y^<y<yn}. (3.29) 



For points Xj in the initial design, define y^ = Wf Xj. See Figures 5.2a|5.2c 





Ai + • • • 


+ A„» 


< a 


~ TT 


Set the 


noise model parameter 










2 

V = 


a(A„+i + • 


■ + Xm) 


and correlation lengths 










^1 = 


S<-^- 


+ ■■ 


■ + Xm) 



4. Design on reduced domain: Choose a set of N points y^ G 3^. For 
each y/e, use a linear program solver to find an x* G [—1,1]^ that satisfies 
Yk = Wf X*. If no such x* exists, then remove yk from the design. 

5. Compute training data: For each remaining y/^, set 

fn,k = fniYk) = /(X*). (3.30) 

6. Train the response surface: Use a maximum likelihood method to find 
the hyperparameter a with bounds 

(3.31) 



(3.32) 



(3.33) 

for the product squared exponential correlation kernel on y. Apply standard 

kriging with the training data /n,/c- 
7. Evaluate the response surface: For a point x G [—1,1]^, compute y = 

Wf X, and evaluate the kriging surface /n(y) and its prediction variance. Set 

/(x) = /„(y). 
We conclude this section with summarizing remarks. First, the case of an infinite 
domain is similar with fewer manipulations for the reduced domain, since there is no 
need for a bounding box. However, it is less clear how to construct a good design on 
an unbounded domain without artificial boundaries. Second, the methods proposed 
for the gradient analysis can be improved substantially with better Monte Carlo meth- 
ods. Sequential sampling techniques combined with a measure of the stability of the 
computed subspaces could be a powerful approach for reducing the number of gra- 
dient evaluations. Alternatively, randomized algorithms for low rank approximation 
offer promise for reducing the number of gradient samples [6", TT . 

Third, as written, the choice of n requires some interaction from the user. We 
advocate such interaction since we have found that one can uncover insights into 
the function / by examining the A^ and the elements of w^. Fourth, as written, 
there is substantial freedom in choosing both the design sites on the reduced domain 
and the response surface. This was intentional. For our purposes, it suffices to use 
the gradient analysis to construct an approximation on the active subspace, but the 
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details of the approximation will require many more practical considerations than we 
can address here. We have chosen kriging primarily because of (i) the natural fit of 
the computed A^ to the correlation length parameters and training data noise model, 
and (ii) its flexibility with scattered design sites. However, many other options for 
approximation are possible including global polynomials, regression splines, or finite 
element approximations; Russi advocates a global quadratic polynomial [33 on the 
subspace. 



Fifth, with the gradient available for /, one could use (2.11) to obtain gradients 
with respect to the reduced coordinates. This could then be used to improve the 
response surface on the active subspace [22]. As also mentioned in J22], since we 
know a great deal about our correlation function, we could create designs that satisfy 
optimality criteria such as maximum entropy. However, this is not straightforward 
because the boundaries of the true reduced domain (not the bounding box) are difficult 
to set as constraints for an optimization procedure. And optimal designs on the 
bounding box are unlikely to be optimal on the true reduced domain. 

Finally, we note that the function evaluations fj could be better used to construct 
the response surface on the subspace. We have intentionally avoided proposing any 
strategies for such use and prefer instead to use them as a testing set for the response 
surface as detailed in the next section. 

4. Numerical examples. In this section, we apply the dimension reduction 
method to a physically motivated model of heat transfer on a three-dimensional tur- 
bine blade with interior cooling holes and a parameterized model for the heat flux 
on the surface. The problem is motivated by [31], where the authors use a Reynolds 
averaged Navier- Stokes (RANS) model of high speed flow over a turbine blade to 
study transition to turbulence of the flow field and its thermal properties. They are 
particularly interested in the variation of the size of the transition zone as model pa- 
rameters vary. The heat flux on the surface of the blade increases sharply in regions 
where turbulence begins to develop, but the location and size of the transition zone 
are very difficult to quantify. 

We model the uncertainty in the transition zone by a spatially varying, parameter- 
ized model for the heat flux on the surface of the blade geometry. The parameterized 
model is inspired by Karhunen-Loeve (KL) expansion models for stochastic processes, 
where the coefficients in a linear combination of spatially varying basis functions are 
interpreted as random variables with a given distribution; the coefficients in the heat 
flux model are assumed to vary independently. These independent coefficients com- 
prise the parameters of the heat flux model. The basis functions are approximate 
eigenfunctions of an exponential correlation kernel scaled by the square root of the 
corresponding eigenvalues. The correlation length parameters are chosen to admit 
rapid fluctuations in heat flux along the surface in the streamwise direction and slow 
variation in the spanwise direction; the rapid fluctuations model the uncertain tran- 
sition zone. 

The quantity of interest is the mean squared value of the computed temperature 
over the trailing 70% of the blade in the streamwise direction, where turbulence is 
most likely to occur. The parameters of the heat flux influence the temperature in the 
blade, which in turn influences the quantity of interest. We thus consider the mean 
squared temperature over the blade tail as a function of the heat flux parameters to be 
the high-dimensional function we wish to approximate. We can obtain the derivatives 
of the quantity of interest with respect to the heat flux parameters by solving the 



adjoint equation of the forward model as detailed in Section 4.2 
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4.1. Forward problem. The heat transfer model is a steady, nonhnear par- 
tial differential equation, where nonlinearity arises because thermal conductivity is 
a function of temperature. Let 5 C M^ with elements s = (si, 52,^3) represent the 



spatial geometry shown in Figure 5.5 The spatial domain comes from a NACA0018 
vertically symmetric airfoil with three interior holes; the domain is scaled to length 
1 in si, length 0.2 in 52, and length 2 in 53. We seek a temperature distribution 
T = T(s) that satisfies 

- Vs • (/^(T) VsT) =0 on 5, 

T = on Si, (4.1) 

-n • (/^(T) VsT) = ^ onS2. 

The thermal conductivity takes the form hz = hzo -\- tziT , where we choose kq = 0.1 
and f<ii = 0.01. The boundary Bi represents the interior cooling holes, and the 
homogeneous Dirichlet boundary conditions model a prescribed temperature. The 
boundary B2 is the external surface of the blade, and the Neumann boundary condi- 
tion g = ^(s,x) models the uncertain heat flux, where x are the model parameters. 
The function g takes the form 

m 
^(S,X) = /i(5i) + j{si)^Xi(j)i{si,S3). (4.2) 

i=l 

The parameters x = [xi, . . . ^Xm]^ are independent and uniformly distributed in the 
hypercube [—3, 3]^. The ^(si, 53) are products of eigenpairs for the correlation kernel 
iC : R2 X R2 ^ R given by 



K 



((.«,.«) 'K''^3^0) =-p(-^r^ 



.(1) J2) 



^3 



.(1) J2) 



(4.3) 



Note that the basis functions are only two-dimensional with variation in streamwise 
and spanwise directions; the correlation length parameters are subscripted to corre- 
spond to spatial coordinates. We choose 61 = 0.01 and 63 = 0.1 to model rapid 
streamwise and slow spanwise fluctuations. The mean function ja = /i(si) and scaling 
function 7 = 7(^1) are chosen so that /i + 57 and ja — 5j are prescribed to cover the 



heat flux ranges between turbulent and laminar flows; these are shown in Figure 5.4 



Figures 5.3a||5.3c show three realizations of heat flux on the turbine blade for three 



different sets of the parameters. 

The temperature T = T(s, x) depends on the parameters x of the heat flux. The 
quantity of interest is expressed as 

/(x) = I T^X{si>0.3)ds, (4.4) 

where X{si > 0.3) is an indicator function that returns 1 on the trailing region of the 
blade. 

4.2. Adjoint equation and derivatives. To obtain derivatives of / with re- 
spect to the heat flux parameters x, we flrst solve the adjoint equation, which takes 
the forward solution as inputs. The nonlinearity in the forward model causes the 
adjoint equation to be an advection-diffusion equation, where the velocity fleld is the 
gradient of the computed temperature. Also, the nonlinear quantity of interest causes 
a forcing term in the adjoint equation that depends on temperature. We do not derive 
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the adjoint equation, but instead refer interested readers to [5l[T9]. We seek a function 
Q = Q{s) that satisfies 

- Vs • (a^(T) VsQ) + /^i VsT • VsQ = 2TI{si > 0.3) on S 

Q = on Si (4.5) 

-n • {^{T) VsQ) = on S2 

With the computed solution Q, the partial derivatives of / can be computed as 

df 



^^. , ^l<P^ds, (4.6) 



where 7 and (j)i are defined in (4.2). 



4.3. Solvers and computations. The solutions to the forward and adjoint 
equations are approximated with a piecewise linear Galerkin finite element method 
(FEM), where we use streamwise upwind Petrov- Galerkin ^Mj to stabilize the adjoint 
equation. We implemented these methods using the Python interface for the FEn- 
iCS/DOLFIN suite of tools [26], where the weak forms of the equations are easily 
expressed in the unified form language (UFL [1 ). Through the DOLFIN interface, 
we employ the PETSc [3] Scalable Nonlinear Equations Solvers with a line search 
Newton method for the discretized forward problem and the PETSc LU direct solver 
for the discretized adjoint problem. 

The geometry and mesh are constructed with Gmsh [15j. The final mesh contains 
32076 nodes. Using the piecewise linear FEM, we thus have 32076 degrees of freedom 
in both the nonlinear forward problem and the linear adjoint problem. We visualize 
the solutions using Paraview [18j. 



The bases for the heat fiux in (4.2) are computed using our Random Field Sim- 
ulation package ^ for Matlab. This package uses the Matlab command eigs, which 
calls ARPACK [23] to compute approximate eigenpairs. These are computed on a 
two-dimensional grid with 101 nodes in the direction corresponding to streamwise 
and 31 nodes in the direction corresponding to spanwise. The computed functions are 
interpolated onto the upper and lower blade surfaces, where the heat fiux is defined. 

The computations were performed on a Dell Precision workstation with dual 
quad core processors and 12 GB of RAM. The DOLFIN library is written so that the 
simulations can run in parallel using all eight cores with a simple mpirun call. With all 
eight cores, each forward problem took approximately 0.4 minutes and each adjoint 
problem plus derivative computations took approximately 0.8 minutes. However, 
these codes were not optimized for performance, even within the DOLFIN framework. 



The solvers with mesh and inputs can be downloaded from https://github.com/ 
Ipaulcon/f enics-blad e, 

4.4. Dimension reduction study. Next we apply the dimension reduction 



method to the quantity of interest / from (4.4). All computations in this section are 



done with custom Matlab scripts on the machine described previously; the scripts can 
be downloaded from www. Stanford. edu/'^paulcon, 

The model for the heat flux is characterized hy m = 250 parameters. Often m 
is chosen to capture some proportion of the energy represented by the eigenvalues of 
the correlation kernel. We plot the eigenvalues for our case in Figure |5.6a[ where it 
is apparent that the decay is very slow. To estimate the energy retained with 250 
terms, we compute the first 500 out of the 3131 eigenvalues and fit an exponentially 
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decaying model. From this computation, we have retained approximately 91% of the 
energy in the field. However, m = 250 was chosen more for practical considerations 
of the computations. 

Given tti, we first obtain M = 750 initial design sites from the parameter space 
[—3,3]^. To do this, we sample sets of M TTi-dimensional points uniformly from the 
parameter space until we obtain a design whose minimum distance between points is 
relatively large; this heuristic is inspired by maximin designs [22 . For each design 
site, we compute the heat fiux and solve (i) the forward problem to get the quantity 
of interest and (ii) the adjoint problem to get the gradient. 

Next we compute the SVD of the matrix of gradient samples and examine the 
singular values. The matrix of gradient samples has dimension 250 x 750, so the 
SVD is trivially computable on the workstation. The squared singular values are 
plotted in Figure |5.6a| and compared to the eigenvalues of the correlation kernel. 
This comparison is only meaningful in the sense of dimension reduction. A common 
strategy for these sorts of problems is to reduce the dimension by truncating the KL 
expansion according to its eigenvalues. In this case, the slow decay suggests very 
little reduction is possible. However, the decay of the squared singular values from 
the gradient samples suggests that substantial reduction is possible for this particular 
quantity of interest. 



Figure 5.6b plots the components of the first two eigenvectors that define the 
one-dimensional and two-dimensional active subspace. Notice that many components 
are nearly zero, which suggests that the quantity of interest is relatively insensitive 



to the corresponding parameters. Figures 5.7a||5.7b show the initial M samples of / 



projected onto the one-dimensional and two-dimensional active subspaces. Already 
clear trends appear in the data when their coordinates are projected onto the proper 
subspaces. 

For the one-dimensional projection, we use five nodes on the reduced domain. For 
the two-dimensional projection, a we place grid of 7 x 7 points on the bounding box 



and follow the procedure described in Section [3.2.2| to compute the training data. The 
pruning procedure left 25 nodes on the two-dimensional subspace. Beyond the initial 
M samples, we compute / 25 more times for the training data. Figures |5.8a||5.8b| 
and |5.9a||5.9b| show one-dimensional and two-dimensional kriging response surfaces, 
respectively, with training data and prediction variances. 

4.5. Comparison with other dimension reduction strategies. We com- 
pare the kriging surface on the active subspace to two other approaches that include 
comparable dimension reduction: (i) a quadratic-mean kriging surface on a one/two- 
dimensional subspace, where the components are the first/second parameters xi and 
X2 corresponding to the largest energy modes in the heat flux model, and (ii) a 
quadratic- mean kriging surface on a one/two-dimensional subspace, where the com- 
ponents are the largest absolute values of the gradient at the nominal parameters 
X = 0. These are standard dimension reduction strategies. In both cases, maximum 
likelihood methods were used to tune a correlation length parameter in a Gaussian 
correlation for the kriging surface. 

We use the set of function evaluations fj generated while computing the gradient 
samples as a test set of size M = 750. With this test set, we compute the error for 



each response surface method. The statistics of these errors are shown in Table 4.1 
and normalized histograms of the errors are shown in Figures |5.10a|57lOb| It is clear 
that the dimension reduction method based on the active subspace defined by samples 
of the gradient is superior for this problem. 
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n = 1 


ASM 


KL 


SENS 


max. err. 
mean err. 


7.5639e-03 
1.9846e-03 


3.4637e-01 
7.6402e-02 


3.2485e-01 
7.0928e-02 


n = 2 


ASM 


KL 


SENS 


max. err. 
mean err. 


6.7920e-03 
1.8801e-03 


3.3185e-01 
6.7964e-02 


2.4453e-01 
6.0523e-02 



Table 4.1: The error of the reduced dimension response surfaces over the testing set. 
Methods labeled KL use the first parameters from the Karhunen-Loeve-type model 



for the heat flux boundary (4.2). Methods labeled SENS use the parameters with 



the largest absolute value of the gradient Vx/ at the origin. Methods labeled ASM 



use the active subspace defined by the first two eigenvectors of the matrix C in (3.2). 
Each method is tested in one and two dimensions. 



5. Summary & conclusions. Active subspace methods seek to approximate a 
multivariate function on a low-dimensional subspace of its domain that contains the 
majority of the function's variability. We have analyzed a theoretical best approxima- 
tion on the active subspace and derived error bounds. We have used these analyses 
to motivate a computational procedure for detecting the directions defining the sub- 
space and constructing a kriging response surface on the subspace to approximate the 
function over its domain. We have applied this procedure to nonlinear heat transfer 
model with a 250-parameter model of uncertain heat fiux on the boundary. We com- 
pared the active subspace method to two other approaches for dimension reduction 
with striking success. 

Loosely speaking, active subspace methods are appropriate for certain classes of 
functions that vary primarily in low- dimensional subspaces of the input. If there is no 
decay in the eigenvalues of C, then the methods will perform poorly; constructing such 
functions is not difficult. However, we have found many high-dimensional applications 
in practice where the eigenvalues do decay quickly, and the functions respond well to 
active subspace methods [71 [121 [101 [H] • Most of those applications look similar to 
the one presented in Section [4J where uncertainty in some spatially varying physical 
input can be represented by a series expansion, and the coefficients of the expansion 
are treated as random variables; such models arise frequently in UQ. 

The computational method we have proposed is ripe for improvements and ex- 
tensions. We have mentioned many such possibilities in Section |3.3[ and we are 
particularly interested in methods for using fewer evaluations of the gradient to com- 
pute the directions deffning the active subspace. We will also pursue strategies that 
make better use of the function evaluations acquired during the gradient sampling. 
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Fig. 5.1: The function /(xi,X2) = exp(0.7xi + 0.8x2) varies along the direction 
[0.7,0.3]-^, and it is flat in the direction [0.3,-0.7]-^. (Colors are visible in the elec- 
tronic version.) 



ACTIVE SUBSPACE METHODS 



19 






(a) 





(b) 






(c) 



Fig. 5.2: (5.2a) A three dimensional cube (blue) projected onto a two dimensional 
plane. The green lines represent the convex hull of the projected corners of the cube. 



The red lines show the bounding box of the convex hull in two dimensions. (5.2b) A 



design on the active subspace. Note that not every point in this design corresponds 



to a point in the original space. (5.2c) The final design for the active subspace is the 



set of points in the grid for which there exists a point in the full space. Note that 
we do not compute the convex hull in practice. (Colors are visible in the electronic 
version.) 




Fig. 5.3: Three realizations of the heat flux ( |4.2| ) on the surface of the turbine blade. 
(Colors are visible in the electronic version.) 
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Fig. 5.4: The mean /i and scaling 7 functions of the Karhunen-Loeve-type model for 



the heat flux (4.2). These functions are crafted to admit heat flux realizations that 



represent a transition to turbulence of the flow over the blade. The mean /i is shown 
in blue. The fluctuation /i ± 57 is shown in red. Five realizations of the heat flux 
down the spanwise centerline of the blade are shown in cyan. The upper surface of 
the blade is shown in black for reference. (Colors are visible in the electronic version.) 
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Fig. 5.5: The temperature distribution for one set of heat flux parameters. (Colors 
are visible in the electronic version.) 
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Fig. 5.6: (5.6a) Eigenvalues of the Karhunen-Loeve like expansion defini ng th e heat 
flux (see ( |4.2[ )) compared to the eigenvalues of the matrix C from (3.2). (5.6b) Com- 
ponents of the first two eigenvectors of C from (3.2) that define the active subspace. 
(Colors are visible in the electronic version.) 
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(a) Id projection 



(b) 2d projection 



Fig. 5.7: Function values fj = f{'^j) for Xj G [—3,3]^ projected onto the one- 
dimensional (5.7a) and two-dimensional ( |5.7b ) active subspace. (Colors are visible 
in the electronic version.) 




(a) Kriging prediction 



(b) Kriging std. 



Fig. 5.8: Kriging approximation /n(y) and its standard deviation on the one- 
dimensional active subspace. (Colors are visible in the electronic version.) 
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10 10 

(a) Kriging prediction 




(b) Kriging std. 



Fig. 5.9: Kriging approximation /n(y) and its standard deviation on the two- 
dimensional active subspace. (Colors are visible in the electronic version.) 




,o(error) 
(a) One-dimensional 



^(error) 
(b) Two-dimensional 



Fig. 5.10: Comparison of errors for subspace projections. One-dimensional projections 



are in Figure [5. lQa[ and two-dimensional projections are in Figure 5.1Qb "KL" refers 
to projecting onto the first two parameters of the heat flux model. "SENS" refers 
to projecting onto the parameters with the two largest components of the gradient. 
"ASM" is the projection onto the two-dimensional active subspace. (Colors are visible 
in the electronic version.) 
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